Performance of Content Based Mining Approach for Multi - lingual Textual Data

نویسندگان

  • Kolla Bhanu Prakash
  • M.A.Dorai Rangaswamy
  • Arun Raja Raman
چکیده

Data mining has become a necessary and powerful tool in the present era of web and internet communications. It has also evolved into media mining wherein heterogeneous data inputs like figures, videos and audios are gradually getting embedded into the web and this makes it quite complex and different. These and other aspects like currency and ‘liveliness’ of the web bring in more interesting features making a shift from translation to especially content extraction. Content extraction in web pages with Indian regional languages or English as the parent language have many aspects like free use of one language in another like ‘computer’ being used as it is with regional text and inclusion of other forms of data like hand written texts or sketches or drawings. This is common in education, news and entertainment and the focus of the current paper is in extracting content in a hybrid document with hand-written texts embedded. Work has been carried out initially with web documents in the form of computer generated text since they are more crisp in nature. Extending the idea, the present paper discusses on the results of hand-written text format and a comparative study with computer generated text format, which are less crisp in nature and more fuzzy depending on the writer. Beginning with letters having common content to words with common content, results of features on pixel maps are presented first. Later extraction using normalisation studies and classification means are presented.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A multilingual text mining approach to web cross-lingual text retrieval

To enable concept-based cross-lingual text retrieval (CLTR) using multilingual text mining, our approach will first discover the multilingual concept–term relationships from linguistically diverse textual data relevant to a domain. Second, the multilingual concept–term relationships, in turn, are used to discover the conceptual content of the multilingual text, which is either a document contai...

متن کامل

The Role of Hubs in Cross-Lingual Supervised Document Retrieval

Information retrieval in multi-lingual document repositories is of high importance in modern text mining applications. Analyzing textual data is, however, not without associated difficulties. Regardless of the particular choice of feature representation, textual data is high-dimensional in its nature and all inference is bound to be somewhat affected by the well known curse of dimensionality. I...

متن کامل

English-Persian Plagiarism Detection based on a Semantic Approach

Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

A Multi-Objective Approach to Fuzzy Clustering using ITLBO Algorithm

Data clustering is one of the most important areas of research in data mining and knowledge discovery. Recent research in this area has shown that the best clustering results can be achieved using multi-objective methods. In other words, assuming more than one criterion as objective functions for clustering data can measurably increase the quality of clustering. In this study, a model with two ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011